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Abstract 

Background: Multiple imputation is often used for missing data. When a model contains as covariates more than 
one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a 
regression with outcome V and covariates X and X^. In 'passive imputation' a value X* is imputed for X and then X^ 
is imputed as {X*)^. A recent proposal is to treat X^ as 'just another variable' (JAV) and impute X and X^ under 
multivariate normality. 

Methods: We use simulation to investigate the performance of three methods that can easily be implemented in 
standard software: 1) linear regression of Xon V to impute Xthen passive imputation of X^; 2) the same regression 
but with predictive mean matching (PMM); and 3) JAV. We also investigate the performance of analogous methods 
when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the 
methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study. 

Results: JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term 
and X is missing completely at random. When X is missing at random, JAV may be biased, but this bias is generally 
less than for passive imputation and PMM. Coverage for JAV was usually good when bias was small. However, in 
some scenarios with a more pronounced quadratic effect, bias was large and coverage poor. When the analysis 
was logistic regression, JAV's performance was sometimes very poor. PMM generally improved on passive 
imputation, in terms of bias and coverage, but did not eliminate the bias. 

Conclusions: Given the current state of available software, JAV is the best of a set of imperfect imputation 
methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression. 



Background 

In most medical and epidemiological studies some of 
the data that should have been collected are missing. 
This presents problems for the analysis of such data. 
One approach is to restrict the analysis to complete 
cases, i.e. those subjects for whom none of the variables 
in the analysis model are missing. Data are said to be 
missing completely at random (MCAR), missing at ran- 
dom (MAR) or missing not at random (MNAR) [1]. 
MCAR means that that the probability of the pattern of 
missing data being as it is depends on neither the 
observed nor the missing data. MAR is the weaker 
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condition that the probability does not depend on the 
missing data given the observed data. MNAR means 
that it depends also on the missing data. When data are 
MCAR, the complete cases constitute a representative 
subsample of the sample, and so the complete-case ana- 
lysis is valid. However, when data are MAR, using only 
complete cases can yield biased parameter estimators. 
Furthermore, even when data are MCAR, this approach 
is inefficient, as it ignores information from incomplete 
cases. 

A method for handling missing data that gives valid 
inference under MAR and which is more efficient than 
just using complete cases is multiple imputation (MI) 
[1]. Here a Bayesian model with non-informative prior 
is specified for the joint distribution of the variables in 
the analysis model, as well as possibly other ('auxiliary') 
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variables. This model is fitted to the observed data 
assuming that they are MAR. A single imputed dataset 
is now created by sampling the parameters of the impu- 
tation model from their posterior distribution, in order 
to account for the uncertainty in this model, and then 
randomly generating ('imputing') values for the missing 
data using these sampled parameter values in the speci- 
fied model. This procedure is repeated multiple times, 
so generating multiple imputed datasets, and then the 
analysis model is fitted to each of these in turn. Finally, 
the complete-data parameter and variance estimates 
from each imputed dataset are combined according to 
simple formulae called Rubin's Rules. Note that when 
some of the variables are fully observed, it is unneces- 
sary to model their distribution and the imputation 
model can be a model for the conditional distribution of 
the remaining variables given these. 

This article is concerned with the use of MI when the 
analysis model includes as covariates more than one 
function of the same variable and this variable can be 
missing. Such situations arise when the analysis model 
includes both linear and higher-order terms of the same 
variable or when the model includes an interaction term. 
This is the case, for example, when non-linear associa- 
tions are explored using fractional polynomials or splines 
[2]. In such situations, the imputation is complicated by 
the functional relationship between the covariates in the 
analysis model. In this article we focus on two particular 
simple settings: where the analysis model is 1) linear 
regression of an outcome Y on covariates X and X^, and 
2) linear regression of Y on covariates X, Z and XZ. 
These are the two settings considered by Von Hippel 
(2009) [3], who also investigated methods for imputing 
variables in the presence of higher-order or interaction 
effects. Unless stated otherwise, we suppose that Y and Z 
are fully observed and X can be missing. We investigate 
three methods of MI that can be easily implemented in 
standard software. 

In 'passive imputation', an imputation model is speci- 
fied for the distribution of X given Y (or X given Y and 
Z). Missing values of X are imputed from this model 
and the corresponding values of the function(s) {X or 
XZ) of X calculated. Von Hippel (2009) [3] called this 
method 'impute then transform'. In principle, there is 
nothing wrong with this method. However, in practice, 
the existence of the higher-order or interaction effects 
makes commonly used imputation models misspecified. 
The conditional distribution of X given Y (and Z) 
depends on the distribution of X (and Z) and the condi- 
tional distribution of Y given X (and Z). In the case of a 
linear regression analysis model, if X (and Z) are 
(jointly) normally distributed and the true coefficient of 
the higher-order or interaction term in the analysis 
model is zero, the conditional distribution of X given Y 



(and Z) is given by the linear regression of X on Y (and 
Z). If the coefficient is not zero, this is no longer so. 
Nevertheless, such a linear regression model would 
commonly be used in practice as an imputation model 
forX 

It is possible that passive imputation might be 
improved by using predictive mean matching (PMM) 
[4]. In this approach, rather than using the imputation 
model to generate missing X values directly, it is used to 
match subjects who have missing X with subjects with 
observed X. Each incomplete case's missing X is then 
imputed as the matching subject's value of X. The moti- 
vation for PMM is that it may be more robust to mis- 
specification of the imputation model and that grossly 
unrealistic imputed values are avoided, since every 
imputed value has actually been realised at least once in 
the dataset. 

Passive imputation and PMM ensure that the imputed 
values conform to the known functional relation between 
the covariates, e.g. that the imputed value of X is equal 
to the square of the imputed value of X. The third 
method of MI that we examine was recently proposed by 
Von Hippel (2009) [3]. This ignores the functional rela- 
tion between covariates and treats a higher-order or 
interaction term as just another variable. Von Hippel 
called this approach 'transform then impute'; following 
White et al. (2011) [5], we call it JAV ('Just Another Vari- 
able'). In this method, missing X and X^ (or XZ) are 
imputed under the assumption that Y, X and X (or Y, X, 
Z and XZ) are jointly normally distributed. Correspond- 
ing imputed values of X and X^ will not, in general, be 
consistent with one another, e.g. X may be imputed as 2 
while X is imputed as 5. However, Von Hippel argued 
that this does not matter for estimation of the parameters 
of the analysis model. We shall examine Von Hippel's 
argument in detail in the Results section. 

In the present article we investigate, using simulation, 
the performance of three methods easily implemented in 
standard software - passive imputation, PMM and JAV - 
in the two settings described above. We look at bias of 
parameter estimators and coverage of confidence inter- 
vals. In addition to considering linear regression analysis 
models, we also look at the logistic regression of binary Y 
on X and X . Von Hippel justified the use of JAV for a 
linear regression analysis model, but suggested that it 
might also work well in the setting of logistic regression, 
because the logistic link function is fairly linear except in 
regions where the fitted probability is near to zero and 
one. In the Methods section, we formally describe the 
three approaches and the simulations we performed to 
assess the performance of these approaches. We also 
describe a dataset from the EPIC study on which we illus- 
trate the methods. In the Results, we present a theoretical 
investigation of the properties of JAV, showing that 
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although JAV gives consistent estimation for linear 
regression under MCAR, it will not, in general, under 
MAR. Results from the simulations and from applying 
the methods to the EPIC dataset are also described there. 
These results are followed by a discussion and 
conclusions. 

Methods 

Three imputation methods 

We begin by describing passive imputation, PMM and 
JAV for the setting of linear regression of ^ on X and 
X . We then describe the modifications necessary for 
regression of ^ on X, Z and XZ. 

Let Xi and y, denote the values of X and Y, respec- 
tively, for subject i {i = !,...,«). Assume that {Xi, Yi),..., 
{X„, y„) are independently identically distributed. Let 
Ri = 1 if Xi is observed (i.e. if subject / is a complete 
case), with 7?, = 0 otherwise. Let «i denote the number 
of complete cases, and q denote the number of reg- 
ression parameters in the imputation model. Let 
X = (Xi, . . . ,Xnf, Wi=Ri (1, Yif (so W= (0, 0)^ whenever 
X is missing), W = (Wi, . . . , W„ f, and ^ = (W^W)"^- 

Passive imputation 

In the approach we call 'linear imputation model with 
passive imputation of X ' (or just 'passive imputation') 
the linear regression model X ~ N (yo + YiY, cr-^) is 
fitted to the complete cases. So, q = 2. Let 
Y = (j/Q, Ki) = i/fW^X denote the resulting maximum 
likelihood estimate (MLE) of K = (yo/Ki), and let 

&^ = J2" ^R<{^i-y^^i)^/{ni-^) denote the 
unbiased estimator of cr^. If Y and are treated as a 
priori independent with joint density proportional to a' 

^, then the posterior distribution of |(ni — (^) ff^j /cr^ is 

Xni-q and that of Y given is N(p, ira^)[6]. So, to 
create a single imputed dataset, cr*^ is drawn from 
{til - q) &^ / X^^^q and y* from N[y, ij/a*^). Then 
missing X values are imputed as Xj = y*^Wi +a*Bp 
where the £/s are independently distributed N{0, 1). 

PMM 

The approach we call 'linear imputation model with pre- 
dictive mean matching' (or just 'PMM') is the same as 
passive imputation up to the generation of o-*^ and y*. 
Thereafter, instead of generating )/*^W, + a*Bi, a fitted 
value X* = y*^W, is calculated for each subject. For 
each subject i with missing X, the K subjects with 
observed Xj and the closest X* values to his or her X* 
value are identified. One of these K subjects is chosen at 



random and his or her X, value becomes the imputed 
value of X,. The square of the imputed value of X, 
becomes the imputed value of X?. The value of K is 
chosen to balance bias in parameter and variance esti- 
mation. If K is very large, matching is very loose, leading 
to bias in parameter estimates of the analysis model. If 
K is very small, uncertainty in the imputed data will not 
be fully represented, leading to underestimation of stan- 
dard errors when Rubin's Rules are applied. For our 
simulations we used K = 5. Notice that if, as in this 
case, the imputation model is a simple linear regression 

of X on Y, finding the subjects with the nearest X* 

values to X* is equivalent to finding the subjects with 

the nearest Yj values to Yi. If the imputation model con- 
tains more than one predictor, PMM may be quite dif- 
ferent from matching on the subjects with the nearest 

JAV 

In the JAV approach, {Y, X, )&) is assumed to be jointly 
normally distributed: 
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Expression (1) can equivalently be written as 
F"-N(/xi,ffu) (2) 
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where (for k = 2, "i) &ko = l^-k - l^iCiilan, &ki = cru/crn, 
m = cfkk - cflJau and T12 = cr23 - o-i20-i3/o^u- Having 
fitted model (1) to the observed data, a perturbation is 
added to the maximum likelihood estimates, in a similar 
way to that described above for the passive imputation 
method. Missing values of X and X^ are then generated 
from distribution (3) using the perturbed values of the 
parameters. As Y is fully observed, an alternative to fit- 
ting model (1) is just to fit model (3) directly. 

The methods described above need only minor adap- 
tion for the setting of linear regression of Y on X, Z and 
XZ. In passive imputation and PMM, the imputation 
model for X should include Z. Obvious choices 
are X ~ N (yo + Ki^ + K22, ff^) (so = Ri{l,Yi,Zif 
and 17 = 3) or X N {^yo + yiY + y2Z + y^YZ, cr-^) (so 
W, = Ri(l, Yi,Zi, YiZif and q = 4). The imputed value of 
X multiplied by the imputed individual's value of Z 
becomes the imputed value of XZ. In JAV, {Y, X, Z, 
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XZ), rather than {Y, X, X^), is assumed to be multivari- 
ate normally distributed. 

Simulation studies 

Linear regression witti quadratic term 

In all our linear regression simulation studies, a sample 
size of 200 was assumed and 1000 simulated datasets 
were created. For each simulated dataset, we generated 
200 X values from one of four distributions with mean 2 
and variance 1: normal, log normal, (shifted and scaled) 
beta, and uniform. For the log normal distribution, logX 

was generated from N ^og (^^^3.2^ , log (5/4) j ; X then 

has a coefficient of skewness of 1.63. For the (shifted 
and scaled) beta distribution, we generated Z ~ beta(l, 
10) and X= 12.05(Z-l/ll)+2; X then has a skewness of 
1.51. The outcome ^ was generated from N(2X+X^,(p), 
where <p was chosen to make the coefficient of determi- 
nation equal to 0.1, 0.5 or 0.8. Although values 
greater than 0.5 are uncommon in medical studies, we 
wanted also to investigate the performance of methods 
in extreme situations. The top two rows of Figure 1 
show, for normally and log-normally distributed X, a 
typical set of data generated in this way. 

Missingness was then imposed on these data. Let expit 
{x) = {l+exp(-;c)}' . Y was fully observed; two missing 
data mechanisms were assumed for X. For MCAR, each 
X was observed with probability 0.7, regardless of the 
values of X and Y. For MAR, the probability X was 
observed was expit(ao + oti^O. where ai=-l/SD(y) and 
0!o was chosen to make the marginal probability of 
observing X equal to 0.7. 

For the three methods, passive imputation ('Passive'), 
PMM and JAV, we used M = 5 imputations. We also 
carried out the complete-case analysis ('CCase') and the 
complete-data analysis ('CData'), i.e. before data 
deletion. 

Finally, we instead generated Y from A'^ ((X-2)^,q>), with 
(p chosen to make R^ = 0.1, 0.5 or 0.8. As the mean of X 
is 2, the quadratic relation between Y and X is now 
more obvious in such data (see Figure 1). 

Linear regression with interaction 

We focussed on normally and log-normally distributed 
covariates. Four bivariate distributions were assumed for 
the two covariates X and Z. In the first, X and Z were 
both independently distributed N{2, 1). In the second, 
they were generated from a bivariate normal distribution 
so that they both had marginal distribution N(2, 1) but 
Cor(X, Y) = 0.5. In the third, logX and logZ were inde- 
pendently distributed N ^log ^V3^^ , log (5/4)^, so 

that X and Z were independently log-normal each with 
mean 2 and variance 1. In the fourth, logX and logZ 



were generated from a bivariate normal distribution 
so that they both had marginal distribution 

N (log (Vsl^ , log (5/4) j but Cor((log X, log Z) = 0.5. 

Outcome Y was generated from N {X+Z+XZ, <p), where 
(p was chosen so that R^ = 0.1 or 0.5. 

Y and Z were fully observed; the same two missing 
data mechanisms were assumed for X as in 'Linear 
regression with Quadratic Term'. Two variations of pas- 
sive imputation were used: in the first ('Passivel'), the 
imputation model contained just Y and Z; in the second 
('Passive2'), the imputation also contained the interac- 
tion YZ. For PMM the imputation model also included 
YZ. Von Hippel [3] considered only Passivel. 

Logistic regression with quadratic term 

A sample size of 2000 was assumed and 1000 simulated 
datasets were created. This larger sample size was used 
because binary outcomes provide less information for esti- 
mating parameter values than do continuous outcomes. 
We used the same normal and log normal distributions 
for X as in 'Linear regression with Quadratic Term'. Binary 
outcomes Ywere generated from the model P(Y = 1\X) = 
expit(/Jo+2/?2-^+/^2-^^)- The value of /?2 was chosen to make 
the log odds ratio of Y for X = ?> versus X = \ equal to 
either 1 (/Ja = 1/12) or 2 (/Jj = 1/6). When X is normally 
distributed, this is the log odds ratio for the mean of X 
plus one standard deviation relative to the mean of X 
minus one standard deviation. The value of /Jq was chosen 
so that the marginal probability oiY = \ was either p = 0.1 
or p = 0.5. 

Y was fully observed; X was MCAR or MAR, with 
probability expit(o!o + CCiY) of being observed. For 
MCAR Ui = 0; for MAR Ui = - 2. In both cases Uq was 
chosen to give a marginal probability of observing Y of 
0.7. For passive imputation and PMM the imputation 
model was the linear regression of X on Y. 

Analysis of vitamin C data from EPIC Study 

EPIC-Norfolk is a cohort of 25,639 men and women 
recruited during 1993-97 from the population of indivi- 
duals aged 45-75 in Norfolk, UK [7]. Shortly after recruit- 
ment, study participants were invited to attend a health 
check at which a 7-day diet diary was provided for comple- 
tion over the next week. Blood samples were provided and 
have been stored. A measure of average daily intake of vita- 
min C has been derived from the 7-day diet diary and 
plasma vitamin C (^mol/1) was measured within a few days 
of the blood sample being provided. The dietary assess- 
ment methods have been described in detail elsewhere [8] . 

There is evidence of a non-linear association between 
vitamin C intake and plasma vitamin C [9]. Here, we 
look at this association in the EPIC-Norfolk data: in par- 
ticular, whether this relation is linear or has a quadratic 
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Figure 1 Typical datasets for normally or log-normally distributed X (each with mean 2 and variance 1), normally distributed V with 
mean 2X + or (X - 2)^ and = 0.1, 0.5 or 0.8. Dotted line shows expected value of Y given X. 



element. Plasma vitamin C is also affected by sex, age, 
smoking status, and body size [9-12], so these possible 
confounders are adjusted for in our analysis. The analy- 
sis presented in this article illustrates the methods 



described here and is not intended as a definitive analy- 
sis of the EPIC data. 

Of the 25639 subjects, 10224 had incomplete data: 
3165 had missing plasma vitamin C; 8100 missing 
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dietary vitamin C; 32 missing weight; 220 missing smok- 
ing status (age and sex were fully observed). If the data 
are not MCAR, estimators from a complete-case analysis 
may be biased. When logistic regression was used on 
the set of subjects with observed plasma vitamin C, 
higher values of log plasma vitamin C were associated 
with a lower probability of being a complete case (p = 
0.03). This suggests the data are not MCAR. Further- 
more, the complete-case analysis ignores information on 
individuals with observed outcome plasma vitamin C 
but one or more missing covariates. For these two rea- 
sons, we applied MI to these data. 

Three forms of MI were used. In the first and second 
forms, the full-conditional specification (FCS: also known 
as 'chained-equations') approach was used [13]. This 
involves cycling through the variables, imputing missing 
values in each variable in turn using a model for the dis- 
tribution of that variable given all the other variables 
except vitamin C-squared. In the first form of MI, a lin- 
ear regression model was used to impute each variable 
except dietary vitamin C-squared and smoking status. 
Multinomial logistic regression was used for smoking sta- 
tus and dietary vitamin C-squared was passively imputed 
from dietary vitamin C. If all variables except dietary vita- 
min C had been observed, this approach would be the 
same as what we called 'linear imputation model with 
passive imputation of X ' in the simulations. The second 
form of MI was identical to the first except that PMM 
was used to impute dietary vitamin C. The third form of 
MI was JAV, i.e. imputation under a multivariate normal 
distribution. For JAV, smoking status was represented by 
two binary indicator variables, one of which was equal to 
1 for former smokers, the other of which equalled 1 for 
never smokers. Imputed values for these binary variables 
were not rounded. This method of handling categorical 
variables was advocated by Ake [14] and, in view of the 
small proportion of missing values in the smoking vari- 
able, should be adequate. For all three MI methods, the 
3165 subjects with missing plasma vitamin C were used 
for imputation but were deleted from the dataset before 
fitting the analysis model. When, as in this case, the same 
set of variables are used in the imputation as in the analy- 
sis, subjects with missing responses can provide informa- 
tion about the joint distribution of the covariates and 
hence about missing covariate values in subjects with 
observed responses, but do not otherwise carry informa- 
tion about the parameters of the analysis model [15]. 
Deletion of such subjects after imputation reduces the 
Monte Carlo error caused by having a finite number of 
imputed datasets. Standard errors were estimated for 
each imputed dataset using the robust variance estimator 
[16], as there was evidence of heteroskedacity (see 
Results). 



We also applied a variant of JAV in which FCS was 
used. This variant was identical to the first form of MI 
except that the dietary vitamin C-squared variable was 
imputed using a linear regression model involving all 
the other variables (including dietary vitamin C) as cov- 
ariates. This is equivalent to imputing dietary vitamin C 
and dietary vitamin C-squared from a bivariate normal 
distribution conditional on all the other variables. 

In all our analyses smoking status was categorised as 
current smoker (baseline), former smoker or never smo- 
ker. The first and second forms of MI and the variant 
JAV method were implemented using ice in STATA. 
The original JAV method was implemented using mi 
impute mvn in STATA. 

Results 

Properties of JAV under MCAR and MAR 

In this section, we summarise the argument of Von Hippel 
(2009) for why the JAV approach will give consistent esti- 
mation of the parameters of the analysis model when the 
data are MAR, and then explain why JAV actually requires 
the stronger condition of MCAR for consistency. 

Assume that the analysis model is the regression of Y on 
X and X (the argument is analogous for the interaction 
model). Model (1), or equivalently (2)-(3), is misspecified, 
since (Y, X, X ) is not joint normally distributed. The 
values of ^1, an, etc. that minimise the Kullback-Leibler 
distance between the true distribution oi {Y, X, X ) and 
the multivariate normal distribution are called their 'least 
false' values [17]. The least false values of fii, ^2 and fi3 are 
just the population means oi Y, X and X^; those of (Tn,..., 
are the population variances and covariances. If the 
missing X and X^ values are imputed from distribution (3) 
with A = ((52o,f52i, ^30. (>3i, ^22. ^23, T33.) equal to its least 
false value, then the mean and variance of {Y, X, X^) in the 
imputed dataset will consistently estimate the mean and 
variance in the population. The true parameter values of 
the analysis model are functions of this population mean 

and variance: they are {£ (L/iL/^)}~^£ (fJiYi) , where 
Ui = (l,Xi,Xj)^- Therefore, if missing X and X^ values 
are imputed using the least false values, the parameters of 
the analysis model will be consistently estimated. 

Von Hippel argues that, when X is MAR, the least false 
values of A can be consistently estimated from the 
observed data. Von Hippel proposes that therefore the 
missing data can be imputed using the assumption that 
{Y, X, X^) is jointly normally distributed (model (1)). He 
does this using PROC MI in SAS. 

When the data are MCAR, the above argument is 
valid. However, when the data are MAR, the observed- 
data MLEs of A are not necessarily consistent for the 
least false value, because model (1) is misspecified. 
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Consequently, the analysis model parameters will not, in 
general, be consistently estimated unless X is MCAR. In 
more detail, the argument is as follows. 

It can be seen from expressions (2) and (3) that the 
MLEs of fii and an are functions only of the 1^ values of 
everyone in the sample and do not depend on X, while 
the MLE of A is a function of the Y, X and X values of 
subjects for whom X is observed. As i'is fully-observed, 
the MLEs of and an consistently estimate their least 
false values. If X is MCAR, the set of individuals for 
whom X is observed is a simple random subsample from 
the sample, and hence also a simple random sample from 
the population. So, A is also consistently estimated by its 
MLE. The rest of Von Hippel's argument now applies. 
On the other hand, if X is MAR, the individuals with 
observed X are not a simple random subsample of the 
sample, and hence represent a random sample from 
another population different from that from which the 
actual sample was drawn. In this other population, the 
least false value of A will generally be different [17,18]. 
The maximum likelihood estimator of A will consistently 
estimate the least false value in this other population. It 
follows that, in general, JAV will give inconsistent estima- 
tion of the parameters in the analysis model. 

The argument that JAV will give consistent estimation 
when X is MCAR can be straightforwardly extended to 
allow for a vector of additional variables, S say. In this 
case, the normal distribution for {Y, X, X^) in equation 
(1) is extended to a normal distribution for {Y, X, X^, S). 



S may include both variables that are additional covari- 
ates in the analysis model and variables that are not in 
the analysis model (i.e. auxiliary variables). If {X, S) are 
MCAR, JAV will give consistent estimation of the para- 
meters of the analysis model. 

So far, we have been concerned with parameter estima- 
tion. Now consider variance estimation. Von Hippel uses 
Rubin's Rules to estimate variances and hence confidence 
intervals. However, there is no particular reason to assume 
that Rubin's Rules will give a consistent variance estimator, 
because derivations of Rubin's Rules assume a correctly 
specified parametric imputation model [19-21] and the 
multivariate normal model is misspecified. In our simula- 
tions, we investigate both the bias in parameter estimators 
and the coverage of confidence intervals calculated using 
Rubin's Rules. 

Simulation studies 

Linear regression witti quadratic term 

We focus on the quadratic term, whose true value is 1. 
The first block of five rows of Table 1 shows the bias and 
95% coverage for the five methods when X is normally dis- 
tributed and MCAR. Also shown is the relative precision, 
i.e. the ratio of the empirical variance of CData to those of 
the other estimators. When both estimators contributing 
to this ratio are unbiased, it is the relative efficiency. The 
maximum (over the five methods) Monte Carlo standard 
errors (MCSE) associated with the estimated biases are 
reported in the table legend. The estimated biases of 



Table 1 Linear regression with Y ~ N {2X+X^, (p) 
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Table 1 Percentage bias, coverage and relative precision for quadratic term in linear regression when Y ~ N {2X+X^, (p). The true value of the quadratic term is 1. 
For MCAR, X ~ normal, the maximum MCSEs among the five methods are 4, 1 and 1% for = 0.1, 0.5 and 0.8, respectively. For MAR, X ~ normal, they are 5, 2 
and 1%. For MAR, X ~ log normal, they are 7, 4 and 3% 
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CData (-3%, -1%, 0%; MCSEs 3%, 1% and 1%) and CCase 
i-2%, -1%, 0%; MCSEs 4%, 1%, 1%) are consistent with 
these estimators being unbiased, as is expected. CData and 
CCase both have estimated coverage 95%, again as 
expected. Due to Monte Carlo variation, the variance of 
CData is sUghtly less than the expected 0.7 times that of 
CCase. JAV is unbiased, as expected; its coverage is 
approximately correct. JAV is, however, slightly less effi- 
cient than CCase. This difference in efficiency narrowed to 
1% when the number of imputations was increased to M = 
30 (data not shown). Note that if other non-fuUy observed 
variables, V say, were included as covariates in the analysis 
model along with X and X , JAV might well be more effi- 
cient than CCase, because CCase would not use data on 
individuals with observed X but missing V, whereas JAV 
would. Passive underestimates the quadratic effect by 
20-30%; coverage is too high when R = 0.1 and too low 
when R = 0.8. PMM is approximately unbiased, but may 
have slight under-coverage. The underestimation by 
Passive of the quadratic effect accords with the finding of 
Von Hippel (2009), who appUed Passive and JAV to a real 
dataset. He found that the estimate of the quadratic effect 
from Passive was closer to zero than that from JAV. 

The above observations remain broadly true when X is 
log-normal, beta or uniform (data not shown). 

The second block of five rows of Table 1 shows bias, 
coverage and relative precision when X is normally distrib- 
uted and MAR. Unlike in the MCAR case, CCase, PMM 
and JAV are now biased. Of the (non-complete data) 
methods, JAV has the smallest bias, and in no case is it 
greater than 12% (MCSE 4%); its coverage is approxi- 
mately correct. We do not discuss relative precisions: they 
are not very meaningful in the presence of bias. The bias 
in PMM probably arises because the largest missing value 
(or values) of X tends to be larger than the largest 
observed value of X, because the probability that X is 
observed decreases as Y increases and individuals with lar- 
ger lvalues tend to have larger X values. This means that 
the imputed values of X tend to be lower than their corre- 
sponding missing true values. 

The third block of five rows of Table 1 shows the 
results when X is log-normally distributed and MAR. 
When R = 0.1, CCase is biased and JAV approximately 
unbiased. However, when R^ = 0.8, JAV is considerably 
more biased than CCase. There is some evidence of slight 
undercoverage of JAV. When X is beta distributed, JAV is 
approximately unbiased when R^ = 0.1, but biased (bias = 
20%) when R^ = 0.8. JAV is approximately unbiased with 
coverage 93% when X is uniformly distributed (data not 
shown). PMM had large bias for all three values of R . 

Increasing the sample size to N = 5000 reduces the 
bias (to < 4%) of PMM when X is normally distributed 
and MAR (data not shown). This is probably because a 
larger sample size enables closer matches to be found 



for individuals with missing X. However, there was no 
consistent improvement when X was log-normally dis- 
tributed. Although the bias reduced from -46% to -12% 
when R^ = 0.1, it stayed the same when R^ = 0.8 and 
increased from -19% to 38% when R = 0.5. The biases 
of Passive and JAV are not improved. Coverage of 
PMM, Passive and JAV worsen, especially when 7? = 
0.8 or X is log-normally distributed: when X is normally 
distributed and R^ = 0.8, coverages were 0%, 67% and 
88%, respectively. We also investigated whether the bias 
of PMM that was still evident for a sample size of N = 
5000 when X was log normally distributed was reduced 
by increasing N to 50000. We found that, although this 
did happen, the bias was still 15% when R^ = 0.5 and 
26% when R^ = 0.8. This is probably because the differ- 
ence between the largest missing value (or values) of X 
and the largest observed value of X could continue to 
be quite large, even with a very large sample size, when 
X is log normally distributed and the missingness 
mechanism means that very large values of X are very 
likely to be missing. 

Table 2 shows the results when Y~N {{X-2f, (p). JAV is 
approximately unbiased when X is MCAR. Estimated cov- 
erage is at least 92% when X is normally or uniformly dis- 
tributed. However, when R^ = 0.8 and X is MCAR and 
log-normally or beta distributed, coverage is only 83% 
(data not shown). The coverage deteriorates as the sample 
size increases: 62%, 71% and 62% when N = 5000 and X is 
log-normally distributed and R^ = 0.1, 0.5 or 0.8, respec- 
tively. When X is MAR and normally distributed, the bias 
of JAV is 18% (coverage 68%) when R^ = 0.5 and 22% 
(coverage 19%) when R^ = 0.8. The bias is 41% or 71% 
when X is MAR and log-normally distributed with R^ = 
0.5 or 0.8, respectively. These biases do not improve as 
sample size increased. The performances of JAV and 
PMM are similar when X is MCAR; JAV can be better or 
worse than PMM when X is MAR. When the sample size 
is increased and X is normally distributed, the bias of 
PMM diminishes (the maximum bias of PMM when N = 
5000 is 6%), but that of JAV does not; when X is log- 
normally distributed, neither bias diminishes for all values 
ofR". 

Linear regression with interaction 

We focus on the interaction term, whose true value is 1. 
Table 3 shows the results when X and Z are independent. 
It can be seen that Passivel is heavily biased even when 
X is MCAR. Including the YZ term in the imputation 
model (Passive2) usually, but not always reduces the bias, 
but it remains substantial. PMM offers little or no 
improvement over Passive 2. JAV performs much better, 
being approximately unbiased. Coverage of JAV is good, 
although there is evidence of slight undercoverage: its 
coverage is consistently less than that of CCase when X is 
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Table 2 Linear regression with Y ~ N ((X-2)^ (p) 
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Table 2 Percentage bias, coverage and relative precision for quadratic term in linear regression when Y ~ N {(X-2)^, (p). For IVICAR, X ~ normal, the maximum 
MCSEs among the five methods are 1, 0 and 0% for = 0.1, 0.5 and 0.8, respectively. For MAR, X ~ normal, they are 1, 1 and 1%. For MAR, X ~ log normal, they 
are 2, 2 and 2% 



Table 3 Linear regression with interaction 
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Table 3 Percentage bias, coverage and relative precision for interaction term in linear regression. The true value of the interaction term is 1. For MCAR, X, Z ~ 
normal, the maximum MCSEs are 4, 1 and 1% for = 0.1, 0.5 and 0.8, respectively. For MAR, X, Z ~ normal, they are 4, 2 and 1%. For MAR, X, Z ~ log normal, 
they are 6, 3 and 1% 
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MAR. The underestimation by Passive of the interaction 
effect accords with the finding of Von Hippel (2009) in a 
real dataset that the estimate from Passive was closer to 
zero than that from JAV. 

When X and Z are correlated (results not shown), the 
biases of passive imputation and PMM change, but 
remain larger than those of JAV, whose biases (and cov- 
erages) change little. PMM is still less biased than passive 
imputation when X is MCAR, but not necessarily when X 
is MAR. For example, when X and Z are (marginally) 
normally distributed and = 0.1, the biases of Passivel, 
Passive2 and PMM are 24%, -16% and 24%, respectively; 
when X and Z are log normally distributed and = 0.1, 
they are -29%, -37% and -64%. 

Logistic regression with quadratic term 

We focus on the quadratic term, whose true value is = 
1/12 or 1/6. Table 4 shows the results. Consider first the 
results when X is MCAR. Passive has large bias. PMM is 
approximately unbiased, but slightly less efficient than 
CCase. In fact, because Y is binary, and so each missing 
X value is replaced by an observed X from a randomly 
chosen individual with the same value of Y, PMM is 



equivalent to CCase but with each complete case receiv- 
ing a random weight. As these weights are (over repeated 
samples) uncorrelated with Y and X, they do not cause 
bias, but do introduce stochastic variation that is not pre- 
sent in CCase. When p = 0.5, = 1/2 and X is normally 
distributed, JAV performs well. This is in conformity 
with the hypothesis of Von Hippel: P{Y = l\X) is fairly 
close to 0.5 over most of the distribution of X, and so the 
logistic function is an approximately linear function of 
the linear predictor over most of the distribution of X. 
When p = O.l, = 1/6 or X is log normally distributed, 
however, JAV performs much worse and in one case 
changes direction: now P(Y = l\X) is closer to zero or 
one for more individuals in the population. 

Now consider the results when X is MAR. When p = 
0.5, the CCase is approximately unbiased with correct 
coverage. This is because the complete cases can be 
regarded as a sample from a case-control study: the prob- 
abilities that individuals are sampled depends on their 
outcomes (Y) but not their exposures (X). As is well 
known, valid inference can be obtained from case-control 
study data using ordinary logistic regression and treating 
Y as the outcome [22] . PMM is approximately unbiased 



Table 4 Logistic regression with quadratic term 
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Table 4 Percentage bias, coverage and relative precision for quadratic term in logistic regression. For MCAR, X ~ normal, the maximum IVlCSEs are 2, 1 and 3% 
for (p, y32)=(0-5,l/12), (0.5, 1/6) and (0.1, 1/12), respectively. For MCAR, X ~ log normal, they are 2, 2 and 2%. For MAR, X ~ normal, they are 2, 1 and 4%. For MAR, 
X ~ log normal, they are 2, 2 and 6% 
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with correct coverage for the same reason. Note that 
when p = 0.1, = 1/12 and X is normally distributed, 
even CCase and PMM are subject to finite sample bias. 
This is because although the sample size is 2000, and so 
the expected number of cases is 200, the expected num- 
ber with observed X is only 47 under this MAR mechan- 
ism. This would seem to be too few to unbiasedly 
estimate the quadratic effect. The bias of JAV can be very 
large: up to 56% when X is normally distributed and 
333% when X is log normally distributed. Unsurprisingly, 
its coverage can also be very poor. 

Analysis of vitamin C data from EPIC study 

Figure 2 is a plot of log plasma vitamin C (^mol/l) 
against log dietary vitamin C (mg/day). There is a sug- 
gestion of a quadratic effect: the gradient appears 
to diminish as the dietary vitamin C value increases. 
Table 5 shows the estimates from a linear regression in 
the complete cases of log plasma vitamin C on log diet- 
ary vitamin C and the confounders. Smoking status is 
categorised as current smoker (baseline), former smoker 
or never smoker. As Figure 2 shows, there is evidence of 
heteroskedasticity. The quadratic effect is highly signifi- 
cant and negative, according with the apparent dimin- 
ishing gradient. 

Table 5 shows the estimates for three of the four MI 
methods. The results from the variant JAV method are 
almost identical to those from JAV, and so are not 
reported here. It can be seen that the point estimates 
from JAV are very similar to those from the complete- 
case analysis. The estimates from the FCS with passive 
imputation method are quite different from those of JAV: 
estimated quadratic and linear effects of log dietary vita- 
min C are stronger. This difference is much reduced 
when PMM is used instead of passive imputation, sug- 
gesting that the linear model used to impute missing 
dietary vitamin C values in the passive method may have 
induced a bias which PMM has reduced. Alternatively, it 
is possible that passive imputation is giving less biased 
estimates than JAV and the complete-case analysis, but 
this seems unlikely in view of the simulation results pre- 
sented earlier. It is perhaps surprising that the passive 
approach has yielded a larger estimated quadratic effect 
than the complete-case analysis; in the simulations the 
converse was true. This shows that the conclusions from 
our simulation studies may not apply when there is 
heteroskedasticity. 

The SEs when using MI are somewhat smaller than 
those from the complete-case analysis, indicating that 
MI has made use of information from the subjects with 
missing data. This gain in efficiency is greater for JAV 
than for PMM. Had there been more missing values in 
the confounders, we would have expected a greater effi- 
ciency gain from using MI. 



Discussion 

In this article, we have investigated imputation of an 
incomplete variable when the model of interest includes 
as covariates more than one function of that variable. We 
have focused on linear regression with a quadratic or 
interaction term, and have examined three imputation 
methods that can be easily implemented in standard soft- 
ware. In STATA, for example, the ice command can be 
used for passive imputation and PMM, and the mi 
impute mvn command for JAV; in R the mice function 
can be used for passive imputation and PMM, and the 
mix library for JAV. Note that although ice and mice use 
chained equations and hence, in general, involve itera- 
tion, when the data are monotone missing, as is the case 
in our simulation studies, no iteration is required. 

In the JAV approach, each function of the incomplete 
variable is treated as an unrelated variable and a multi- 
variate normal imputation model is used. Von Hippel 
(2009) claimed that this would give consistent estimation 
for linear regression when the data were MAR. In this 
paper we have shown that the consistency actually 
requires MCAR; when data are MAR, bias is to be 
expected. None of the three MI methods we investigated 
worked well in all the MAR scenarios considered. In gen- 
eral, JAV performed better than passive imputation or 
PMM for linear regression with a quadratic or interaction 
effect. We have shown, however, that there are circum- 
stances in which JAV can have large bias for the quadra- 
tic effect of a linear regression model. JAV was found to 
perform very badly when the analysis model is a logistic 
regression, unless the outcome is common and covariates 
only have small effects on its probability. In view of this, 
we recommend that, given the current state of available 
software, JAV is the best of a set of imperfect imputation 
methods for linear regression with a quadratic or interac- 
tion effect, but should not be used for logistic regression. 
For logistic regression, the best performing imputation 
method was PMM. However, when X (and X^) are the 
only covariates in the model and are MAR, the complete- 
case analysis is unbiased, and hence we recommend its 
use in that case. 

In our simulations, we found that using PMM was 
nearly always better than using passive imputation with- 
out PMM. However, for linear regression analysis mod- 
els, its performance was usually worse than JAV. 

In the scenarios we considered in our simulations, the 
analysis model only involves one variable (X) and its 
square, or two variables {X and Z) and their interaction. 
We have not considered additional covariates. We lim- 
ited our investigation to these simple cases, because it is 
important to understand the performance of the meth- 
ods in these scenarios before moving on to more com- 
plicated scenarios. Further research is needed into the 
performance of the methods when additional covariates 
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Log dietary vitamin C (micrograms/day) 
Figure 2 Log plasma vitamin C and log dietary vitamin C in 15415 individuals for whom both variables are observed 



are involved. In the scenarios we considered, the com- 
plete-case analysis is unbiased when the data are MCAR 
and is actually more efficient than all of the imputation 
methods considered. However, MI will be more efficient 
than just using complete cases when the analysis model 
involves additional non-fuUy observed covariates, 
because individuals with observed X and Y but missing 
other covariates will be excluded from the complete- 
case analysis but do contribute information when MI is 



used. Furthermore, the complete-case analysis may be 
biased when the data are not MCAR. 

We (like Von Hippel, 2009) have presented JAV as a 
method using a multivariate normal imputation model. 
If the data are MCAR, JAV with this imputation model 
will give consistent point estimation in linear regression. 
The principle of JAV, i.e. that functions of the same 
variable are treated as separate and the functional rela- 
tion between them ignored, is not tied to the normal 



Table 5 Analysis of vitamin-C data 
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Est 


SE 


Est 


SE 


Est 


SE 


Est 


SE 


intercept 


0.990 


0.201 


0.570 


0.177 


0.903 


0.181 


1.030 


0.163 


log diet C 


1.141 


0.090 


1.322 


0.079 


1.163 


0.081 


1.106 


0.075 


log diet C sqrd 


-0.090 


0.010 


-0.113 


0.009 


-0.094 


0.009 


-0.088 


0.008 


sex 


0.169 


0.008 


0.173 


0.007 


0.172 


0.007 


0.172 


0.007 


weiglit (per 10 Kg) 


-0.042 


0.003 


-0.041 


0.003 


-0.040 


0.003 


-0.041 


0.003 


age (per 10 yrs) 


-0.052 


0.004 


-0.043 


0.003 


-0.043 


0.003 


-0.043 


0.003 


former smoker 


0.212 


0.015 


0.213 


0.012 


0.213 


0.012 


0.212 


0.012 


never smol<er 


0.216 


0.014 


0.218 


0.012 


0.218 


0.012 


0.219 


0.012 



Table 5 Point estimates and SEs from complete-case analysis and three Ml methods (full conditional specification with and without predictive mean matching, 
and JAV) for the regression of log plasma vitamin C on log dietary vitamin C ('log diet C), its square ('log diet C sqrd') and a set of confounders 
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distribution. However, the properties of a method using 
the JAV principle with another imputation model are 
thus far unknown. 

Conclusions 

JAV gives consistent estimation for linear regression 
with a quadratic or interaction term when data are 
MCAR, but may be biased when data are MAR. The 
bias of JAV can be severe when used for logistic regres- 
sion. JAV is the best of a set of imperfect methods for 
linear regression with a quadratic or interaction effect, 
but should not be used for logistic regression. 
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